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Abstract. Statistical models of unobserved heterogeneity are typically formalized as mix- 
tures of simple parametric models and interest naturally focuses on testing for homogeneity 
versus general mixture alternatives. Many tests of this type can be interpreted as C(a) 
tests, as in Neyman (1959), and shown to be locally, asymptotically optimal. A unified 
approach to analysing the asymptotic behavior of such tests will be described, employing 
a variant of the LeCam LAN framework. These C(a) tests will be contrasted with a new 
approach to likelihood ratio testing for mixture models. The latter tests are based on esti- 
mation of general (nonparametric) mixture models using the Kiefer and Wolfowitz (1956) 
maximum likelihood method. Recent developments in convex optimization are shown to 
dramatically improve upon earlier EM methods for computation of these estimators, and 
new results on the large sample behavior of likelihood ratios involving such estimators yield 
a tractable form of asymptotic inference. We compare performance of the two approaches 
identifying circumstances in which each is preferred. 



1. Introduction 

Given a simple parametric density model, f(x, -d), for iid observations, Xi, ■ ■ ■ , X n , there 
is a natural temptation to complicate the model by allowing the parameter, to vary 
with i. In the absence of other, e.g. covariate, information that would distinguish the 
observations from one another it may be justifiable to view the #'s as drawn at random. 
Inference for such mixture models is complicated by a variety of problems, notably their 
lack of identifiability. Two dominant approaches exist: Neyman's C(a) and the likelihood 
ratio test. C{a) is particularly attractive for testing homogeneity against general forms of 
heterogeneity for the parameter such tests have a relatively simple asymptotic theory, and 
are generally easy to compute. The LRT, in contrast, is more easily adapted to compound 
null hypotheses, but has a much more complicated limiting behavior, and is generally more 
difficult to compute. 

We will argue that recent developments in convex optimization have dramatically reduced 
the computational burden of the LRT approach for general, nonparametric alternatives. 
We will present a tractable large-sample theory for the LRT that conforms well to our 
simulation evidence, and exhibits both good size and power performance. Comparisons 
throughout with C(a), which is asymptotically locally optimal, demonstrate that the LRT 
can be a highly effective complementary approach. 
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2. Likelihood Ratio Tests for Mixture Models 

Lindsay (1995) offers a comprehensive overview of the vast literature on mixture models. 
He traces the idea of maximum likelihood estimation of a nonparametric mixing distribution 
F, given random samples from the mixture density, 

(1) g(x) = J <p(x,d)dF(0), 

to Robbins (1950). Kiefer and Wolfowitz (1956) filled in many details of the Robbins 
proposal and yet only with Laird (1978) did a viable computational strategy emerge for it. 
The EM method proposed by Laird has been employed extensively in subsequent work, e.g. 
Heckman and Singer (1984) and Jiang and Zhang (2009), even though it has been widely 
criticized for its slow convergence. Recently Koenker and Mizera (2012) have noted that the 
Kiefer- Wolfowitz estimator can be formulated as a convex optimization problem and solved 
very efficiently by interior point methods. Recent work by Liu and Shao (2003) and Aza'is, 
Gassiat, and Mercadier (2009) has clarified the limiting behavior of the LRT for general 
classes of alternatives, and taken together these developments offer a fresh opportunity to 
explore the LRT for inference on mixtures. 

It seems ironic that many of the difficulties inherent in maximum likelihood estimation 
of finite parameter mixture models vanish when we consider nonparametric mixtures. The 
notorious multimodality of parametric likelihood surfaces is replaced by a much simpler, 
strictly convex optimization problem possessing a unique, unimodal solution. It is of obvious 
concern that consideration of such a wide class of alternatives may depress the power of 
associated tests; we will see that while there is some loss of power when compared to more 
restricted parametric LRTs, the loss is typically modest, a small price to pay for power 
against a broader class of alternatives. We will see that by comparison with C(a) tests that 
are also designed to detect general alternatives, the LRT can be competitive. 



2.1. Maximum Likelihood Estimation of General Mixtures. Suppose that we have 
iid observations, X\, ■ ■ ■ ,X n from the mixture density ([!]), the Kiefer- Wolfowitz MLE re- 
quires us to solve, 

n 

r GJ~ — 

i=l 

where T is the (convex) set of all mixing distributions. The problem is one of minimizing the 
sum of convex functions subject to linear equality and inequality constraints. The dual to 
this (primal) convex program proves to be somewhat more tractable from a computational 
viewpoint, and takes the form, 

n n 

max{) logi/j I y vnp{xi,d) < n, for all i?} 

i=l i=l 

See Lindsay (1983) and Koenker and Mizera (2012) for further details. This variational 
form of the problem may still seem rather abstract since it appears that we need to check 
an infinite number of values of for each choice of the vector, v. However, it suffices in 
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applications to consider a fine grid of values • • • , i? m } and write the primal problem as 



where A is an n by m matrix with elements <p(xi,"&j) and S = {s G K m |l s = 1, s > 0} 
is the unit simplex. Thus, fj denotes the estimated mixing density evaluated at the grid 
point, $j and gi denotes the estimated mixture density evaluated at x%. The dual problem 
in this discrete formulation becomes, 



Primal and dual solutions are immediately recoverable from the solution to either problem. 
Interior point methods such as those provided by PDCO of Saunders (2003) and Mosek of 
Andersen (2010), are capable of solving dual formulations of typical problems with n = 200 
and m = 300 in less than one secondQ 

Solutions to the nonparametric MLE problem of Kiefer and Wolfowitz produce estimates 
of the mixing distribution, F, that are discrete and possessing only a few mass points. A 
theoretical upper bound on the number of these atoms of n was established already by 
Lindsay (1983), but in practice the number is actually observed to be far fewer. It may 
seem surprising, perhaps even disturbing, that even when the true mixing distribution has 
a smooth density, our estimates of that density is discrete with only a handful of atoms. 
This may appear less worrying if we consider a more explicit example. Suppose that we 
have a location mixture of Gaussians, 



so we are firmly in the deconvolution business, a harsh environment notorious for its poor 
convergence rates. One interpretation of this is that good approximations of the mixture 
density g can be achieved by relatively simple discrete mixtures with only a few atoms. For 
many applications estimation of g is known to be sufficient: this is quite explicit for example 
for empirical Bayes compound decision problems where Bayes rules depend entirely on the 
estimated g. Of course given our discrete formulation of the Kiefer- Wolfowitz problem, we 
can only identify the location of atoms up to the scale of the grid spacing, but we believe that 
the m ~ 300 grid points we have been using are probably adequate for most applications. 

Given a reliable maximum likelihood estimator for the general nonparametric mixture 
model it is of obvious interest to know whether an effective likelihood ratio testing strategy 
can be developed. This question has received considerable prior attention, again Lindsay 
(1995) provides an authoritative overview of this literature. However, more recently work by 
Liu and Shao (2003) and Azais, Gassiat, and Mercadier (2009) have revealed new features 
of the asymptotic behavior of the likelihood ratio for mixture settings that enable one to 
derive asymptotic critical values for the LRT. 



The R empirical Bayes package REBayes, Koenker (2012), is available from the second author on request. 
It is based on the RMosek package of Friberg (2012), and was used for all of the computations reported below. 



n 




n 




v > 0}. 



i=l 
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2.2. Asymptotic Theory of Likelihood Ratios for General Mixtures. Consider the 
general problem of testing that observed data come from a family of distributions (i ? 1 ?)^eei 
against the alternative that the distribution is F$ for some $ G ©2\©i where we assume 
01 C 02- Liu and Shao (2003) provide tools that allow one to derive the limiting distribu- 
tion of the likelihood ratio test statistic under very general conditions. In particular, ©i, O2 
need not be subsets of R d but are allowed to be subsets of general metric spaces. 

Denote the "true" density of the data by / and start by considering a point null hypothesis, 
i.e. Ho : f = fo for some density fo (note that we do not assume that there exists a unique 
parameter $ corresponding to /# = fo, although in the setting considered here this will 
in fact turn out to be the case). Denote by £#(cc) := ftf(x)/fo(x) the likelihood ratio and 
consider the test statistic for Ho against Hi : / = fy 7^ /o 

n 

su P L n (tf), L n (#) := Vlog^pQ). 

Adapting Theorem 3.1 of Liu and Shao (2003) yields the following theorem. 
Theorem 2.1. Define the classes of functions 

Fe,e := {S« := ^ I < < e} 

and 



Je,o := {S G L 2 : 3(# n ) neN C s.t. £>(<?„) = o(l), ||-^-y " s| 2 = o(l)} 

wrat/i D 2 {"9) := E[(^ — l) 2 ]. Assume that the following three conditions hold 

(1) For sufficiently small e > 0, £/te c/ass ^e.e * s Donsker. 

(2) For any sequence (i? n )neN C tmtt D(i} n ) = o(l) i/iere exists a subsequence 
($n k )keN C and a function S G .7-0,0 ll^ n ~~ ^Ih = 

f3) For any S* G ^0,0 £aere exists a pai/i <S) : < t < e} C suc/i £/ta£ i 1— >■ 
£&{t,S) i s continuous, with respect to the I? norm, D($(t, S)) > /or all t > and 
linit-K) S^s) = 5 in L 2 . 



T/ten 



2supL„(tf) ~* ( sup lf s V0)' 

i?e© v Se J" e 



where (Ws)seJ r denotes a centered Gaussian process with covariance structure Cov(Wf, W g ) 
E[f(X)g(X)}. 



Remark 2.2. The condition (2) is called completeness by Liu and Shao. The condition (3) 
is slightly different from the assumption called continuous sample paths in Definition 2.4 of 
Liu and Shao (2003). However, a closer look at the relevant proofs reveals that condition 
(3) is in fact sufficient. 
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The asymptotic distribution of likelihood ratio tests with composite null hypothesis, i.e. 
the test ■d G ©2 versus -d G ©i is then given by, 

( sup W s V0) 2 -( sup W s V0) 2 - 

The two main challenges in applying this result thus consist in describing the classes J-q 
and in verifying the technical assumptions. Let us start by considering a classical example 
where the limiting distribution is known. 

Example 2.3. Assume that all parameters are Revalued and identifiable, i.e. $1 = #2 if 
and only if f# 1 = /# 2 , and that D($ n ) = o(l) if and only if i9 n — > $q. Moreover, let f$ be 
"nice" (sufficiently smooth with respect to 1? in a uniform sense). For simplicity, assume 
that the true parameter is zero and that it is an interior point of 0. A Taylor expansion 
then yields (the first equality holding in an L 2 -sense) 

U n - 1 = 0;% + (|K||), d\K) = vlW'M^n + o(IKII 2 ), 

and the class of functions Tq is identified as 

jr f vT ^0 c qd-l\ 

6 I (v T 'E[£' (£' ) T ]v) 1 / 2 ]■ 

Identify this class of functions with the sphere S^ 1 . The covariance structure of the 
Gaussian process W is now given by 

K VV V ,VV W ) (jJ T E[£ /( fo )T] t ,)l/2( w T IE [ £ /(^)T] !t; )l/2- 

The seemingly formidable "Gaussian process" thus turns out to be merely a collection of 
scaled linear combinations of a d— dimensional normal distribution. More precisely, its 
distribution coincides with that of (Z v ) veS d-i where 

Finally, writing Y = E[£' (£' ) T ] 1 / 2 Y with Y ~ Af(0, I d ) yields the expected xj distribution 

w s vo' 2 



( sup W S V0) 2 ~ Yl 2 + ... + Yj - xl 



Let us now turn to the more specialized case of testing homogeneity against arbitrary 
mixtures, i.e. 



H : / = f/x for some n G M against Hi : f(x) = / f^{x)dG(n), G G Pm\P j 

Jm 



(i) 

M 



(k) 

where Pm denotes the set of probability measures on M, and P M the set of distribution 
functions that have exactly k mass points. Under Hq, there exists a fiQ G M with / = 
The parameter set can now be identified with the set of measures Pm, that are identified 
with their distribution functions. The symbols /g,£g will be used to denote the mixture 
density and likelihood ratio corresponding to the distribution G. Abusing notation, we will 
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use the symbol to denote both, the function x i-> fn(x)/f fi0 (x) and the quantity 1$^, with 
8^ denoting the distribution with point mass at fi. 

In principle, the results of Liu and Shao (2003) can be applied in this situation un- 
der rather general conditions on the mixture and mixing distributions. A closely related 
approach was recently taken by Azais, Gassiat, and Mercadier (2009), who derive the distri- 
bution of the likelihood ratio test for a single distribution against arbitrary mixtures under 
fairly general conditions. 

Example 2.4. Consider mixtures of AA(/i, 1) distributions and assume that M = [L,U] 
with G M. According to theorem 3 in Azai's, Gassiat, and Mercadier (2009), the asymp- 
totic distribution of the log-likelihood ratio test statistic 

n n 

2( sup VlogfoTO- sup Vlog^jpQ)) 

under the null of Xi ~ A/"(0, 1) i.i.d. is given by 

D= ( sup (v g )+Y-y? 

v G&Pm j 
where (Vg)g^p m IS the Gaussian process given by 

" Y k n k (G)\ ,f ^ 4(G) x 1/2 



v o-={^-(^)/{2^^r) 



with Yi,Y2,... denoting i.i.d. J\f(0, 1) distributed random variables, Kk(G) := f M /j, k dG([i) 
and x+ denoting the positive part of x. As we will show in the appendix, there is a simpler 
expression for the distribution of D. More precisely, we will demonstrate that 

Approximating the distribution function G on M by a discrete distribution function with 
masses p±,...,pN on a fine grid mi, ...,m,N leads to the approximation 

In particular, maximizing the right-hand side with respect topi, ...,Pn under the constraints 
Pi > = 1 f° r fixed grid mi,...,mjv can be formulated as a quadratic optimization 

problem of the form 

mmp T Ap under pi > 0, p T b = 1 

v 

where p = {pt, ...,p N ), Aij = Y,T=2 (mj ^ i)k , &i = J2T=2 JW>m » if m f x& * > °- If m f xfe i < 0> 
we can set D = 0. This suggests a practical way of simulating critical values after replacing 
the infinite sum by a finite approximation and avoiding the grid point 0. The table below 
contains simulated critical values in some particular settings. All results are based on 10, 000 
simulation runs with the sums for A and b cut off at k = 25 and grids with 200 points equally 
spaced points excluding the point 0. 
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M 


90% 95% 99% 


[-1,1] 


2.75 3.95 6.93 


[-2,2] 


3.90 5.37 8.71 


[-3,3] 


5.34 6.87 10.46 


[-4,4] 


6.38 8.32 11.91 



Table 1. Simulated asymptotic critical values for the asymptotic null dis- 
tribution for various sets M. 



3. Neyman C(a) Tests for Mixture Models 

Neyman's C(a) tests can be viewed as an expanded class of Rao (score) tests that ac- 
commodate general methods of estimation for nuisance parameters. In regular likelihood 
settings C(a) tests are constructed from the usual score components. Suppose we have 
iid X\,X-2,--- ,X n from the density <p(x, and we would like to test the hypothesis, 
Hq : £ = Co on a P dimensional parameter versus the alternative Hi : £ ^ £o- Given a 
Y^n-consistent estimator, $ n , of the nuisance parameter we will denote by 



C^ n = n- 1 / 2 ^V e log^(X i , 1 ?,£) 



€=«0 



C^ n = n- 1 / 2 ^V^log^(X i , 1 ?,0 
i=i 

the score vectors with respect to £ and d respectively. Following Akritas (1988) and Chibisov 
(1973) the C(a) test of Hq can be viewed (asymptotically) as a conditional test in the 
limiting Gaussian experiment. In the limit experiment, (C^ n , C# ;n ) are jointly Gaussian 
with Fisher information covariance matrix, 



I = 



The conditional test of Hq based on a single observation from this limit distribution depends 
on the vector, 

that has covariance matrix, I^—I^I^I^, which is the inverse of the ££-block of the inverse, 
1^, of the full Fisher information matrix. Thus, the C{a) test statistic, T n = g^I^g n , is 
asymptotically % 2 , and is locally optimal for alternatives of the form, £ n = £o + S/y/n. The 
salient practical advantage of T n , lies in the option to use any -y/n- consistent estimator for 
i9 n . When $ n is the maximum likelihood estimator of •& under the £ = £o constraint the 
C{a) procedure reverts to the Rao score test. 

Regularity conditions for the foregoing results were originally given by Neyman (1959) 
and extended by Biihler and Puri (1966) as variants of the classical Cramer conditions. 
An alternative formulation can be constructed from the differentiability in quadratic mean 
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(DQM) condition of LeCam (1970). The latter approach, as discussed in more detail in Gu 
(2012), seems to be more appropriate for the consideration of C(a) testing for mixtures. 

C(a) tests for heterogeneity in mixture models typically take a simple form although their 
theory requires some substantial amendment from the regular cases we have just described. 
Suppose we have random variables {Xi, • • • , X n } with X, ~ f(x, and the iVs are given 
by 

#i = # + TiUi 

where the Ui are iid with distribution function, F, with E(U) = and V(U) = 1. The 
parameter r denotes a known scale parameter, and we are interested in testing the null 
hypothesis, Hq : £ = 0. Under these circumstances it can be easily seen that the usual 
score test procedure breaks down, because the first logarithmic derivative of the density 
with respect to £ is identically zero under the null, 

J^log J ip{x,~d + T£u)dF(u)\( ;=0 = t J udF(u) ■ <p' o (x,0)/(po(x,0) = 0, 

and consequently the usual Fisher information about £ is zero. All is not lost, as Neyman 
was already aware, we can simply differentiate the log likelihood once again, 

d 2 f 

^2 log J ip(x,$ + T£u)dF(u)\£ =0 

= t 2 J u 2 dF(u) VoMM)M - (t J udF(u)-&' (x,$)/<po(x,$))^ 

= T 2 (Po(x,$)/ipo{x,-&). 

This second-order score function replaces the familiar first-order one and provides an ana- 
logue of Fisher information for C(a) parameter heterogeneity inference. 

3.1. Asymptotic Theory for C(a) Tests of Parameter Heterogeneity. The locally 
asymptotic normal (LAN) apparatus of LeCam can be brought to bear to establish the 
large sample behavior of the C(a) test. We will sketch the argument in the simplest scalar 
parameter case, referring the reader to Gu (2012) for further details. 

Let {X\, ■ ■ ■ ,X n } be a random sample from the density p(x\£, i?), with respect to the 
measure, fx. We would like to test the composite null hypothesis, Ho : £ = £o £ S C R in 
the presence of the nuisance parameter, i9 G 9 C MP, against H± : £ G £ \ {£o}- 

Assumption 3.1. The density function p satisfies the following conditions: 

(i) £o is an interior point o/S 

(ii) For all $ £ Q and £ G S, the density is twice continuously differentiable with respect 
to £ and once continuously differentiable with respect to •& for all x. 

(Hi) Denoting the first two derivatives of the density with respect to £ evaluated under 
the null as V^p(x|£o, "&) and V|p(x|£o, we have P(V^p(x|£o, i9) = 0) = 1 and 
P(V|pOr|£o,tf)^0)>0. 

(iv) Denoting the derivative of the density with respect to •& evaluated under the null as 
V#p(a;|£o, i?), for any p- dimensional vector a, P(V?p(x|£o, i?) / a T V^p(x|£o, $)) > 0. 



GU, KOENKER AND VOLGUSHEV 



9 



The crucial additional requirement is that p satisfies the following modified version of 
LeCam's differentiability in quadratic mean (DQM) condition. The LeCam approach has 
two salient advantages under the present circumstances: it avoids making superfluous fur- 
ther differentiability assumptions, and it removes any need for the symmetry assumption 
on the distribution of the heterogeneity that frequently appears in earlier examples of such 
tests. 

Definition 3.2. The density p(x\^,'d) satisfies the modified differentiability in quadratic 
mean condition at (£o>$) if there exists a vector v{x) = (v\(x), vzix)) £ ^(/-O such that as 

(£n,tfn)^(£ ,tf), 

Wp{x\tn,$n) ~ VpW^&) - hlv{x)\ 2 n{dx) = o{\\h n f) 

where h n = ((£„ — £o) 2 > ($n — $) T ) T - Let ft{h n ) be the mass of the part of p{x\^ n ,'d n ) that 
is p(x\£ ,-&)- singular, then as (Cn,i?n) -> (£o,$), P(K)/\\h n \\ 2 -> 0. 

The second-order score function with respect to £ implies that the corresponding term in 
h n is quadratic, and this in turn implies the 0(ra~ 1//4 ) rate for the local alternative in the 
following theorem. 



Theorem 3.3. Suppose (X\, ■ ■ ■ ,X n ) are iid with density p satisfying Assumption 3.1 and 
the modified DQM condition with, 

v(x) = (v 1 (x),v 2 ( X )) = ^-^—Iw.,^^ 7mm -^,>o ] ) . 

Denote the joint distribution of the Xi's by P n £,>d- Then for fixed 5\ and 82, the log-likelihood 
ratio has the following quadratic approximation under the null: 

An = log = SSn - §t T Ji + 0P (1) 



dR 



where t = (5 2 , 5j) T 



Sin] _ I ^ v^N^*) 



S 2n 2 y. jqM 

and 

j = 4 r (w T )fi(dx) = ( co :^ ln ffi) - (f J r 

1 cov(5i n ,5 2 „) E(S 2n S 2n ) J \J2i J22 



Given that the conditions for the log likelihood ratio expansion are met local asymptotic 
optimality and the distribution of the C (a) test statistic under appropriate local alternatives 
follows. 
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Theorem 3.4. Lete n be a sequence of experiments based on iid random variables (Xi, • • • , X n ) 
with joint distribution P n ^ Q+Sin -i/4 ^ + s 2Tl -i/2 indexed by t = (Sf, 5j) T G M + x R p ; such that 
the log-likelihood ratio satisfies, 

io g ( ^y^' ) = s Sn - \^jt + op( i), 

with the sequence S n converging in distribution under the null toN(0, J). Then the sequence 
e n converges to the limit experiment based on observing one sample from Y = t + v where 
v ~ M(0, J -1 ). The locally asymptotically optimal statistic for testing, Hq : S± = vs. 
Hi : <5i ^ is 

Z n = {Jll — J12J22J21) 1 ^ 2 (Si n — J12J22 S^n)- 

It has distribution A/"(0, 1) under Hq, and distribution J\f(5f(Ju — JyiJyi Jix) 1 ^ ■> 1) under 
Hi. We reject Hq if (OV Z n ) 2 > c a withc a , the (I — a) quantile of the 5X0 + 5X1 distribution. 

Remark 3.5. The behavior of the test statistic under the specified contiguous alternatives, 
follows from LeCam's third lemma. Under the null, we have 

((_£ Jt ), (; 12 <?;,)) 

with CJ12 = cov(Z n , A n ) = <5f(Jn — J12J22 J21) 1 / 2 ■ LeCam's third lemma then implies that 
under the local alternative with £ n = £0 + <5in -1 / 4 and i? n = i? + ^ra^ 1 / 2 , 

Z n P ^A^r 12 ,l). 

Note that the test statistic Z n is a function of 1?. All of the results above hold if the true 
nuisance parameter $ is used in the test statistic, which is infeasible. In practise, we can 
plug in a consistent estimator for 1?, say In order for the preceding results to be useful, 
we need to ensure that Z n ($) — Z n ($) = op(l). There are various ways to obtain this kind 
of result. The classical approach by Neyman (1959) was to make additional differentiability 
and boundedness assumptions on the function g, which is defined as 

= (Jn - j 12 j^j 21 )-^\-^M= - Jl2 j£ 

such that Z n ($) = ^ Y^id( x i^)- Details of these assumptions can be found in Neyman 
(1959, Definition 3). The assumptions are rather strong since they require the density to 
be three time differentiable with respect to Another approach, however, is to view the 
difference Z n {$) — Z n ($) as an empirical process. More precisely, if the following condition 
holds, we can show the difference goes to zero in probability. Details can be found in Gu 
(2012). 

Assumption 3.6. Assume that for every i3 G 6 there exists some 5 > such that for any 
77, rf G U&{§) we have for some 7 > 

\g(x, v) - g( x , v')\ < \\v - ffV H {x) 
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for P n c ^-almost all x (for every n £ N) where H is square integrable with respect to 
P n £ ni <0 for all n G N, sup n Ep n ^ ^H 2 (X) < oo and additionally for some c n = o(l) 



n^E Pn ^[H(X)I{H(X) > n^c n }} = o(l). 



Theorem 3.7. Under Assumptions 
estimator for then 



3.1 



3.6 and the DQM condition, if i? is a consistent 



\Z n (0) - Z n (&)\ = o P (l) 



Example 3.8. Consider testing for homogeneity in the Gaussian location mixture model 
with independent observations X{ ~ Af($i, l),i = 1, • • ■ , n. Assume that •&{ = $o + r£U~i, 
for known r, and iid Ui ~ F with WJ = and YU = 1. We would like to test Hq : 
£ = with the location parameter tDq treated as a nuisance parameter. The second- 
order score for £ is found to be, V| log (p(x, $o, £ = 0) = r 2 ((x — i9o) 2 — 1) and the 
first-order score for t?o is, V^ log 4>(x, $0, ^ = 0) = (x — i?o)- Note that under the null, 
J12 = cov(V| log tp{X, 0, i9o) ; V^ log (f(X, 0, ^o)) = 0. Thus, we have the locally asymptot- 
ically optimal C(a) test as 

1 n 

i=l 

The obvious estimate for the nuisance parameter is the sample mean, and we reject the null 
hypothesis when (0 V Z n ) 2 > c a . 



4. Some Simulation Evidence 

To explore the finite sample performance of the methods we have discussed we begin with 
an experiment to compare the critical values of the LRT of homogeneity in the Gaussian 
location model with the simulated asymptotic critical values of Table 1 . We consider sample 
sizes, n £ {100, 500, 1000, 5000, 10000} and four choices of the domain of the MLE of the 
mixture is estimated: {[— : j = 1, • • • , 4}. We maintain a grid spacing of 0.01 for the 
mixing distribution on these domains for each of these cases for the Kiefer-Wolfowitz MLE. 
Results are reported in Table 2. For the three largest sample sizes we bin the observations 
into 300 and 500 equally spaced bins respectively. It will be noted that the empirical 
critical values are consistently smaller than those simulated from the asymptotic theory. 
There appears to be a tendency for the empirical critical values to increase with n, but this 
tendency is rather weak. This finding is perhaps not entirely surprising in view of the slow 
rates of convergence established elsewhere in the literature, see e.g. Bickel and Chernoff 
(1993) and Hall and Stewart (2005). 

To compare power of the C(a) and LRT to detect heterogeneity in the Gaussian location 
model we conducted four distinct experiments. Two were based on variants of the Chen 
(1995) example with the discrete mixing distribution -F(t?) = (1 — tyfih/(l-X)+^-h/\- ^ n * ne 
first experiment we set A = 1/3, as in the original Chen example, in the second experiment 
we set A = 1/20. We consider four tests: (i) the C(a) as described in Example 3.8, (ii.) a 
parametric version of the LRT in which only the value of h is assumed to be unknown and 
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n cval(.90) cval(.95) cval(.99) 

Tl^Tt-2,2] [-3,3] [-4,4] [-1,1] [-2,2] [-3,3] ggf EES I" 2 ' 2 ] I" 3 ' 3 ] ['MT 

100 2.09 2.69 2.80 2.80 3.07 3.70 3.97 4.06 6.43 7.58 8.31 8.55 

500 2.22 2.80 2.96 2.98 3.06 3.87 4.41 4.41 5.69 7.07 7.45 7.52 

1,000 2.67 3.46 3.72 3.76 3.73 4.95 5.44 5.56 7.26 8.55 9.51 9.76 

5,000 2.68 3.56 3.91 3.96 3.79 4.54 4.83 5.09 6.52 8.15 8.32 8.38 

10,000 2.41 3.11 3.29 3.46 3.61 4.45 4.72 4.97 6.23 7.51 7.96 8.32 

oo 2.75 3.90 5.34 6.38 3.95 5.37 6.87 8.32 6.93 8.71 10.46 11.91 

Table 2. Critical Values for Likelihood Ratio Test of Gaussian Parameter 
Homogeneity: The first five rows of the table report empirical critical values 
based on 1000 replications of the LRT based on the Kiefer-Wolfowitz esti- 
mate of the nonparametric Gaussian location mixture distribution. Results 
for sample sizes 5,000 and 10,000 were computed by binning the observations 
into 300, 500 equally spaced bins respectively. Restriction of the domain of 
the mixing distribution is indicated by the column labels. The last row 
reproduces the simulated asymptotic critical values reported in Table 1. 



the relative probabilities associated with the two mass points are known; this enables us 
to relatively easily find the MLE, h by separately optimizing the likelihood on the positive 
and negative half-line and taking the best of the two solutions, (iii.) the Kiefer-Wolfowitz 
LRT computed with equally spaced binning on the support of the sample, and finally as 
benchmark (iv.) the classical Kolmogorov-Smirnov test of normality. The sample size in 
all the power comparisons was taken to be 200, with 10,000 replications. We consider 21 
distinct values of h for each of the experiments equally spaced on the respective plotting 
regions. 

In the left panel of Figure 1 we illustrate the results for the first experiment with A = 1/3: 
C(a>) and the parametric LRT are essentially indistinguishable in this experiment, and 
both have slightly better performance than the nonparametric LRT. All three of these tests 
perform substantially better than the Kolmogorov-Smirnov test. In the right panel of Figure 
1 we have results of another version of the Chen example, except that now A = 1/20, so the 
mixing distribution is much more skewed. Still C(a) does well for small values of h, but for 
h > 0.07 the two LRT procedures, which are now essentially indistinguishable, dominate. 
Again, the KS test performance is poor compared to the other tests explicitly designed for 
the mixture setting. 

In Figure 2 we illustrate the results of two additional experiments, both of which are 
based on mixing distributions with densities with respect to Lebesgue measure. On the left 
we consider F($,h) = I(—h < "d < h)/(2h). Again, we can reduce the parametric LRT to 
optimizing separately over the positive and negative half-lines to compute the MLE, h. This 
would seem to give the parametric LRT a substantial advantage over the Kiefer-Wolfowitz 
nonparametric MLE, however as is clear from the figure there is little difference in their 
performance. Again, the C(a) test is somewhat better than either of the LRTs, but the 
difference is modest. In the right panel of Figure 2 we have a similar setup, except that 
now the mixing distribution is Gaussian with scale parameter h, and again the ordering is 
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very similar to the uniform mixing case. In both of the latter experiments, the parametric 
LRT is somewhat undersized at the null; so we made an empirical size adjustment the two 
LRT curves. In all four figures the KW-LRT has been similarly size adjusted according to 
its performance under the null in the respective experiments. 



Power Comparison — Chen I Power Comparison — Chen II 




Figure 1. Power Comparison of Several Tests of Parameter Homogeneity: 
The left panel illustrates empirical power curves for four tests of parameter 
homogeneity for the Chen (1995) mixture with A = 1/3, in the right panel 
we illustrate the power curves for the same four tests for the Chen mixture 
with A = 1/20. Note that in the more extreme (right) setting, the LRTs 
outperform the C(a) test. 



5. Conclusion 

We have seen that the Neyman C{a) test provides a simple, powerful, albeit irregular, 
strategy for constructing tests of parameter homogeneity. Many examples of such tests 
already appear in the literature, however the LeCam apparatus provides a unified approach 
for studying their asymptotic behavior that enables us to relax moment conditions employed 
in prior work. In contrast, likelihood ratio testing for mixture models has been somewhat 
inhibited by their apparent computational difficulty, as well as the complexity of its as- 
ymptotic theory. Recent developments in convex optimization have dramatically reduced 
the computational effort of earlier EM methods, and new theoretical developments have 
led to practical simulation methods for large sample critical values for the Kiefer-Wolfowitz 
nonparametric version of the LRT. Local asymptotic optimality of the C(a) test assures 
that it is highly competitive in most circumstances, but we have illustrated at least one case 
where the LRT has a slight edge. The two approaches are complementary; clearly there 
is little point in testing for heterogeneity if there is no mechanism for estimating models 
under the alternative. Since parametric mixture models are notoriously tricky to estimate, 
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Figure 2. Power Comparison of Several Tests of Parameter Homogeneity: 
The left panel illustrates empirical power curves for four tests of parameter 
homogeneity for uniform mixtures of Gaussians with i? on [— h, h], on the 
right panel the same four power curves are depicted for Gaussian mixtures 
of Gaussians with standard deviation h. 



it is a remarkable fact that the nonparametric formulation of the MLE problem a la Kiefer- 
Wolfowitz can be solved quite efficiently - even for large sample sizes by binning - and 
effectively used as an alternative testing procedure. We hope that these new developments 
will encourage others to explore these methods. 
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Appendix A. Technical details 



Proof of |i) Given a measure G G P M ,G ^ S define V(G) := J2T=2 Also > 
define for n S N and a £ [-N, N] the probability measure G n := p n 5 Cn + (1 — p n )G with 
p n := 1 — V(G)/n and c n := 1 ~ Pn (a — ki(G)) [the dependence of p n , c n on G is suppressed 

in the notation]. Note that for n sufficiently large we have G n £ Pm for all a £ [— N, N]. 
Moreover, by construction Ki(G n ) = a(l — p n ) and 



Kk(G n ) = K k (G)(l -p n ) + (1 -p n ) 



1 -Pr, 
Pn 



fe-1 



(a-«i(G))' 



for n € N. This implies for n sufficiently large we have a.s. 



k=2 



Y k K k (G) 

{kiy/ 2 



i 



E 



Y k n k (G n ) 



< 



1-Pn 
Pn 



E 

k=2 



\Y k \c k n 



Pn 



Pn 



k-2 



< 



2C 2 V(G) 



n 



E 

k=2 



and 



a 



E 

k=2 



4(G) 
k\ 



(1 - Pr, 



E 



4(6n) < CV(O) 



k=l 



n 



for finite constants C, C depending only on N but not on a and G [note that G G Pm has 
support contained in [L, U]]. Thus for every iV < oo,e > we have with probability at 



16 Inference for Mixture Models 

least 1 — e [this follows by choosing n sufficiently large] 

Eoo Y k K k (G) v , v^oo Y k n k (G) 
k=l (fel)l/2 U2 1^Z^A;=2 (fcl)l/2 
SUp —pr > SUp SUp —pr - e. 

GeP M ^ y^T-i ^wr ) a £l- N > N ] GeP M (^ a 2 + y^_ 2 
Next, observe that N can be chosen so large that with probability at least 1 — /(e) 



n V I v^oo Y k K k {G) 

sup sup 1/2 < I ill +£ 



aem\[-N,N] GeP M ^ a 2 + 2 ^ 



where f(a) — > for a — > 0. Finally, note that 

Eoo YfcKfc(G) 

sup ^ — > ii a.s. 

^Z^fc=l fc! ^ 

[consider the sequence of measures G n = # s j ffn (Yi)/n e P/w]- 

Summarizing the findings above, we have shown that for any e > we have with probability 
arbitrarily close to one: 

Eoo Y k K k (G) v Y kK k (G) 

k=l (fel)V2 "^l 2-,fc=2 (fcl)l/2 

sup ^ — > sup sup 1/9 - £ - 

GeP M ^ ^-oo_ i 4^ 7 «etGeP M ^ 2 + ^» ^ «^G) ^ 

By letting e — >■ the above can be turned in an almost sure inequality with no e on the 
right-hand side. Finally, setting a = k\{G) we see that the converse inequality also holds 
almost surely. Thus we have shown that 

Eoo Y k K k (G) v . v-^oo Y k K k (G) 

k=l (fel)i/2 al l^l^k=2 (fcl)i/2 

sup — — --- = sup sup i/o a - s - 

GeP M ^ ^-oo_ i «fc(g) ^ 7 aeRGeP M ^2 + ^ - 4 (G) N 1 



=2 fci y 



Define /3fc := ^y^i ■ Fix a realization of Y±,Y2, .... First, observe that it suffices to consider 
the supremum over G € Pm with ifcAfc > 0- Fixing G 6 Pm shows that in the case 

Efc^=2 ifcA > the supremum with respect to a on the right-hand side above is attained for 

a* = ii Y°° = \^p h 1 an d P m gg m g this into the equation above we obtain (after some simple 
algebra) 

sup 



for every G G Pm with Y^k=2 Ykfik > 0. In the case J2k%2 YkPk = we obtain a* = sign(Y\). 
Summarizing the above arguments yields 



Y^OO Y k K k (G) ( ( V^OO y n 

sup l / V oo 4(G) y/aj J - Fl + sup 



GeP M V V°° + GeP *f l^k=2Pk 
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and this directly implies ^ □ 



